library("readxl")
library(dplyr)
library(tidyr)
library(stats)
library(ggplot2)
library(car)
library(knitr)
library(rstatix)
library(evaluator)
library(multcomp)
library(vegan)
library(plotly)
library(ggthemes)
theme_set(theme_bw())

Data exploration

Case

Briefly about the article: in this work, the researchers found out whether memantine is able to positively influence the learning ability of down mice (trisomy on chromosome 21). They described the effects of memantine on protein expression in brain regions of Ts65Dn mice exposed to context fear conditioning (CFC).

For this purpose, the following experiment was set up: mice were placed in a novel cage, allowed to explore for three minutes and then given an electric shock. These mice are the context-shock (CS) group and learn to associate the context with the aversive stimulus. Learning is displayed by “freezing” upon re-exposure to the context, where freezing is defined as a lack of movement. A second group of mice were placed in the novel cage, immediately given the electric shock, and then allowed to explore for 3 min. These mice are the shock-context (SC) group and do not acquire conditioned fear. All mice received an injection of memantine or the equivalent volume of saline 15 minutes prior to exposure to the novel context. Eight groups of mice (7-10 per group) were used: trisomic and controls, CS mice injected with saline or with memantine and SC mice injected with saline or with memantine, as described

Experiment scheme

Classes:

  • c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)

  • c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)

  • c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)

  • c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)

  • t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)

  • t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)

  • t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)

  • t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

Data description

First, let’s read the data and see how many observations we have

mouse_data <- read_excel('Data_Cortex_Nuclear.xls')
count(mouse_data)
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1080

We see that there are much more observations than the mice were taken for the experiment (1080 vs 72). This means that repeated observations were made for the mice. Which mouse the repetition belongs to can serve as an important factor in further analysis, so we need to split the MouseID column into two: the ID itself and the repeatability

mouses <-  mouse_data %>% separate(col = MouseID, into = c("MouseID", "measurement"), sep = "_")

Let’s also examine the data structure.

str(mouses)
## tibble [1,080 × 83] (S3: tbl_df/tbl/data.frame)
##  $ MouseID        : chr [1:1080] "309" "309" "309" "309" ...
##  $ measurement    : chr [1:1080] "1" "2" "3" "4" ...
##  $ DYRK1A_N       : num [1:1080] 0.504 0.515 0.509 0.442 0.435 ...
##  $ ITSN1_N        : num [1:1080] 0.747 0.689 0.73 0.617 0.617 ...
##  $ BDNF_N         : num [1:1080] 0.43 0.412 0.418 0.359 0.359 ...
##  $ NR1_N          : num [1:1080] 2.82 2.79 2.69 2.47 2.37 ...
##  $ NR2A_N         : num [1:1080] 5.99 5.69 5.62 4.98 4.72 ...
##  $ pAKT_N         : num [1:1080] 0.219 0.212 0.209 0.223 0.213 ...
##  $ pBRAF_N        : num [1:1080] 0.178 0.173 0.176 0.176 0.174 ...
##  $ pCAMKII_N      : num [1:1080] 2.37 2.29 2.28 2.15 2.13 ...
##  $ pCREB_N        : num [1:1080] 0.232 0.227 0.23 0.207 0.192 ...
##  $ pELK_N         : num [1:1080] 1.75 1.6 1.56 1.6 1.5 ...
##  $ pERK_N         : num [1:1080] 0.688 0.695 0.677 0.583 0.551 ...
##  $ pJNK_N         : num [1:1080] 0.306 0.299 0.291 0.297 0.287 ...
##  $ PKCA_N         : num [1:1080] 0.403 0.386 0.381 0.377 0.364 ...
##  $ pMEK_N         : num [1:1080] 0.297 0.281 0.282 0.314 0.278 ...
##  $ pNR1_N         : num [1:1080] 1.022 0.957 1.004 0.875 0.865 ...
##  $ pNR2A_N        : num [1:1080] 0.606 0.588 0.602 0.52 0.508 ...
##  $ pNR2B_N        : num [1:1080] 1.88 1.73 1.73 1.57 1.48 ...
##  $ pPKCAB_N       : num [1:1080] 2.31 2.04 2.02 2.13 2.01 ...
##  $ pRSK_N         : num [1:1080] 0.442 0.445 0.468 0.478 0.483 ...
##  $ AKT_N          : num [1:1080] 0.859 0.835 0.814 0.728 0.688 ...
##  $ BRAF_N         : num [1:1080] 0.416 0.4 0.4 0.386 0.368 ...
##  $ CAMKII_N       : num [1:1080] 0.37 0.356 0.368 0.363 0.355 ...
##  $ CREB_N         : num [1:1080] 0.179 0.174 0.174 0.179 0.175 ...
##  $ ELK_N          : num [1:1080] 1.87 1.76 1.77 1.29 1.32 ...
##  $ ERK_N          : num [1:1080] 3.69 3.49 3.57 2.97 2.9 ...
##  $ GSK3B_N        : num [1:1080] 1.54 1.51 1.5 1.42 1.36 ...
##  $ JNK_N          : num [1:1080] 0.265 0.256 0.26 0.26 0.251 ...
##  $ MEK_N          : num [1:1080] 0.32 0.304 0.312 0.279 0.274 ...
##  $ TRKA_N         : num [1:1080] 0.814 0.781 0.785 0.734 0.703 ...
##  $ RSK_N          : num [1:1080] 0.166 0.157 0.161 0.162 0.155 ...
##  $ APP_N          : num [1:1080] 0.454 0.431 0.423 0.411 0.399 ...
##  $ Bcatenin_N     : num [1:1080] 3.04 2.92 2.94 2.5 2.46 ...
##  $ SOD1_N         : num [1:1080] 0.37 0.342 0.344 0.345 0.329 ...
##  $ MTOR_N         : num [1:1080] 0.459 0.424 0.425 0.429 0.409 ...
##  $ P38_N          : num [1:1080] 0.335 0.325 0.325 0.33 0.313 ...
##  $ pMTOR_N        : num [1:1080] 0.825 0.762 0.757 0.747 0.692 ...
##  $ DSCR1_N        : num [1:1080] 0.577 0.545 0.544 0.547 0.537 ...
##  $ AMPKA_N        : num [1:1080] 0.448 0.421 0.405 0.387 0.361 ...
##  $ NR2B_N         : num [1:1080] 0.586 0.545 0.553 0.548 0.513 ...
##  $ pNUMB_N        : num [1:1080] 0.395 0.368 0.364 0.367 0.352 ...
##  $ RAPTOR_N       : num [1:1080] 0.34 0.322 0.313 0.328 0.312 ...
##  $ TIAM1_N        : num [1:1080] 0.483 0.455 0.447 0.443 0.419 ...
##  $ pP70S6_N       : num [1:1080] 0.294 0.276 0.257 0.399 0.393 ...
##  $ NUMB_N         : num [1:1080] 0.182 0.182 0.184 0.162 0.16 ...
##  $ P70S6_N        : num [1:1080] 0.843 0.848 0.856 0.76 0.768 ...
##  $ pGSK3B_N       : num [1:1080] 0.193 0.195 0.201 0.184 0.186 ...
##  $ pPKCG_N        : num [1:1080] 1.44 1.44 1.52 1.61 1.65 ...
##  $ CDK5_N         : num [1:1080] 0.295 0.294 0.302 0.296 0.297 ...
##  $ S6_N           : num [1:1080] 0.355 0.355 0.386 0.291 0.309 ...
##  $ ADARB1_N       : num [1:1080] 1.34 1.31 1.28 1.2 1.21 ...
##  $ AcetylH3K9_N   : num [1:1080] 0.17 0.171 0.185 0.16 0.165 ...
##  $ RRP1_N         : num [1:1080] 0.159 0.158 0.149 0.166 0.161 ...
##  $ BAX_N          : num [1:1080] 0.189 0.185 0.191 0.185 0.188 ...
##  $ ARC_N          : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
##  $ ERBB4_N        : num [1:1080] 0.145 0.15 0.145 0.141 0.142 ...
##  $ nNOS_N         : num [1:1080] 0.177 0.178 0.176 0.164 0.168 ...
##  $ Tau_N          : num [1:1080] 0.125 0.134 0.133 0.123 0.137 ...
##  $ GFAP_N         : num [1:1080] 0.115 0.118 0.118 0.117 0.116 ...
##  $ GluR3_N        : num [1:1080] 0.228 0.238 0.245 0.235 0.256 ...
##  $ GluR4_N        : num [1:1080] 0.143 0.142 0.142 0.145 0.141 ...
##  $ IL1B_N         : num [1:1080] 0.431 0.457 0.51 0.431 0.481 ...
##  $ P3525_N        : num [1:1080] 0.248 0.258 0.255 0.251 0.252 ...
##  $ pCASP9_N       : num [1:1080] 1.6 1.67 1.66 1.48 1.53 ...
##  $ PSD95_N        : num [1:1080] 2.01 2 2.02 1.96 2.01 ...
##  $ SNCA_N         : num [1:1080] 0.108 0.11 0.108 0.12 0.12 ...
##  $ Ubiquitin_N    : num [1:1080] 1.045 1.01 0.997 0.99 0.998 ...
##  $ pGSK3B_Tyr216_N: num [1:1080] 0.832 0.849 0.847 0.833 0.879 ...
##  $ SHH_N          : num [1:1080] 0.189 0.2 0.194 0.192 0.206 ...
##  $ BAD_N          : num [1:1080] 0.123 0.117 0.119 0.133 0.13 ...
##  $ BCL2_N         : num [1:1080] NA NA NA NA NA NA NA NA NA NA ...
##  $ pS6_N          : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
##  $ pCFOS_N        : num [1:1080] 0.108 0.104 0.106 0.111 0.111 ...
##  $ SYP_N          : num [1:1080] 0.427 0.442 0.436 0.392 0.434 ...
##  $ H3AcK18_N      : num [1:1080] 0.115 0.112 0.112 0.13 0.118 ...
##  $ EGR1_N         : num [1:1080] 0.132 0.135 0.133 0.147 0.14 ...
##  $ H3MeK4_N       : num [1:1080] 0.128 0.131 0.127 0.147 0.148 ...
##  $ CaNA_N         : num [1:1080] 1.68 1.74 1.93 1.7 1.84 ...
##  $ Genotype       : chr [1:1080] "Control" "Control" "Control" "Control" ...
##  $ Treatment      : chr [1:1080] "Memantine" "Memantine" "Memantine" "Memantine" ...
##  $ Behavior       : chr [1:1080] "C/S" "C/S" "C/S" "C/S" ...
##  $ class          : chr [1:1080] "c-CS-m" "c-CS-m" "c-CS-m" "c-CS-m" ...

We see that such columns as ID, genotype, treatment, behavior and class are defined as character, however it is better to change them to a factor

mouses$class <- as.factor(mouses$class)
mouses$Behavior <- as.factor(mouses$Behavior)
mouses$Treatment <- as.factor(mouses$Treatment)
mouses$Genotype <- as.factor(mouses$Genotype)
mouses$MouseID <- as.factor(mouses$MouseID)

Let’s also take a look at the representation of the classes. We can argue that the groups are more or less balanced

class_count <- mouses %>% group_by(class) %>%  count() 
class_count
## # A tibble: 8 x 2
## # Groups:   class [8]
##   class      n
##   <fct>  <int>
## 1 c-CS-m   150
## 2 c-CS-s   135
## 3 c-SC-m   150
## 4 c-SC-s   135
## 5 t-CS-m   135
## 6 t-CS-s   105
## 7 t-SC-m   135
## 8 t-SC-s   135

Let’s also see how many missing values there are in the data and for whшср indicators. We also find out how many complete observations are in the dataset.

mouse_na <- colSums(is.na(mouses))

sum(is.na(mouses))
## [1] 1396
as.table(mouse_na)
##         MouseID     measurement        DYRK1A_N         ITSN1_N          BDNF_N 
##               0               0               3               3               3 
##           NR1_N          NR2A_N          pAKT_N         pBRAF_N       pCAMKII_N 
##               3               3               3               3               3 
##         pCREB_N          pELK_N          pERK_N          pJNK_N          PKCA_N 
##               3               3               3               3               3 
##          pMEK_N          pNR1_N         pNR2A_N         pNR2B_N        pPKCAB_N 
##               3               3               3               3               3 
##          pRSK_N           AKT_N          BRAF_N        CAMKII_N          CREB_N 
##               3               3               3               3               3 
##           ELK_N           ERK_N         GSK3B_N           JNK_N           MEK_N 
##              18               3               3               3               7 
##          TRKA_N           RSK_N           APP_N      Bcatenin_N          SOD1_N 
##               3               3               3              18               3 
##          MTOR_N           P38_N         pMTOR_N         DSCR1_N         AMPKA_N 
##               3               3               3               3               3 
##          NR2B_N         pNUMB_N        RAPTOR_N         TIAM1_N        pP70S6_N 
##               3               3               3               3               3 
##          NUMB_N         P70S6_N        pGSK3B_N         pPKCG_N          CDK5_N 
##               0               0               0               0               0 
##            S6_N        ADARB1_N    AcetylH3K9_N          RRP1_N           BAX_N 
##               0               0               0               0               0 
##           ARC_N         ERBB4_N          nNOS_N           Tau_N          GFAP_N 
##               0               0               0               0               0 
##         GluR3_N         GluR4_N          IL1B_N         P3525_N        pCASP9_N 
##               0               0               0               0               0 
##         PSD95_N          SNCA_N     Ubiquitin_N pGSK3B_Tyr216_N           SHH_N 
##               0               0               0               0               0 
##           BAD_N          BCL2_N           pS6_N         pCFOS_N           SYP_N 
##             213             285               0              75               0 
##       H3AcK18_N          EGR1_N        H3MeK4_N          CaNA_N        Genotype 
##             180             210             270               0               0 
##       Treatment        Behavior           class 
##               0               0               0
# 
mouses_without_na <- na.omit(mouses)
count(mouses_without_na)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   552

Are there differences in the level of BDNF_N production depending on the class in the experiment ?

There are 8 classes of mice in this dataset; 15 technical replicates were carried out for each of 72 mice. For pairwise comparison of a large number of classes, it is necessary to use ANOVA. I decided to consider the class to which the mouse and the mouse belong, as individual characteristics play an important role in the experiment. I did not use each of the class parameters separately, otherwise the analysis would be very cumbersome and complex. Unfortunately, I was unable to carry out anew test with the model lm(BDNF_N ~ class + MouseID , data = mouses). However, carrying out an analysis without taking into account the fact that we have repetitions in the data, I think it is incorrect. Below I have given the results of anova in case we do not take into account repetitions, and if we do take into account (for this I took the average values for each of the mice).

Chek assumptions

First, let’s group our data by class and ID variables and look at the summary statistics for the variable BDNF_N. I will also use the obtained statistics for anova based on averaged BDNF_N expression values.

BDNF_mean <- mouses %>%
  group_by(class, MouseID) %>%
  get_summary_stats(BDNF_N, type = "mean_sd")
BDNF_mean
## # A tibble: 72 x 6
##    MouseID class  variable     n  mean    sd
##    <fct>   <fct>  <chr>    <dbl> <dbl> <dbl>
##  1 309     c-CS-m BDNF_N      15 0.356 0.038
##  2 311     c-CS-m BDNF_N      15 0.335 0.033
##  3 320     c-CS-m BDNF_N      15 0.399 0.037
##  4 321     c-CS-m BDNF_N      15 0.359 0.033
##  5 322     c-CS-m BDNF_N      15 0.315 0.013
##  6 3415    c-CS-m BDNF_N      15 0.315 0.055
##  7 3499    c-CS-m BDNF_N      15 0.288 0.044
##  8 3507    c-CS-m BDNF_N      15 0.317 0.017
##  9 3520    c-CS-m BDNF_N      15 0.341 0.033
## 10 3521    c-CS-m BDNF_N      15 0.366 0.045
## # … with 62 more rows

Let’s remove the missing values of BDNF_N (we can do this since there are only 3 missing observations) and create box plots of the BDNF_N colored by class. We will not display individual mice on the graph, which will not clutter it up.

BDNF_N_without_NA <- mouses  %>% filter( !is.na(BDNF_N)) 
ggplot(BDNF_N_without_NA, aes(class, BDNF_N, color = class)) + stat_summary(fun.data = "mean_cl_normal") + ggtitle(label = "BDNF expression versus class plot")

Let’s build a linear model of the dependence of the BDNF_N expression on the class and ID of the mouse and check the applicability conditions using it (since ANOVA has similar applicability conditions to linear models).

fit <- lm(BDNF_N ~ class , BDNF_N_without_NA)
fit2 <- lm(mean ~ class, BDNF_mean)

We see that if only the mouse class is taken into account, the residuals plot looks good, the distribution looks normal. In the case of using averaged values, the plot of residuals raises doubts: the shift in median values, the dispersion between some groups is very different.

simple_diag <- fortify(fit) 
simple_diag2 <- fortify(fit2) 

ggplot(simple_diag, aes(x = 1:nrow(simple_diag), y = .cooksd)) +
  geom_bar(stat = 'identity')

ggplot(simple_diag2, aes(x = 1:nrow(simple_diag2), y = .cooksd)) +
  geom_bar(stat = 'identity')

ggplot(data = simple_diag, aes(x = class, y = .stdresid, colour = class)) +
  geom_boxplot() + geom_hline(yintercept = 0)

ggplot(data = simple_diag2, aes(x = class, y = .stdresid, colour = class)) +
  geom_boxplot() + geom_hline(yintercept = 0)

qqPlot(fit, id = FALSE) 

qqPlot(fit2, id = FALSE) 

Computation

In both cases, the ANOVA results indicate a difference in protein expression depending on the mouse class. However, in the case of using averaged values, the pvalue is close to 0.05, which may indicate that there is an insignificant difference. For more accurate conclusions, you need to look at the post-hack tests.

res.aov <- anova_test(
  data = BDNF_N_without_NA, dv = BDNF_N, wid = MouseID,
  between = class
  )
res.aov
## ANOVA Table (type II tests)
## 
##   Effect DFn  DFd      F        p p<.05  ges
## 1  class   7 1069 18.816 8.86e-24     * 0.11
res.aov2 <- anova_test(
  data = BDNF_mean, dv = mean, wid = MouseID,
  between = class
  )
res.aov2
## ANOVA Table (type II tests)
## 
##   Effect DFn DFd    F     p p<.05   ges
## 1  class   7  64 2.21 0.045     * 0.195

The results of post-hock tests for a model built without taking into account replicates in the data show a much larger number of classes that differ in protein expression than for a model built using averaged values. That is, without taking into account the individual characteristics of the mouse, we can find more differences in the expression of BDNF. In fact, we can talk about a significant difference in the level of BDNF expression only among the groups c-SC-m & c-CS-s. Also close to the presence of significant differences in the expression (pvalue = 0.06) of the groups c-SC-m & c-CS-m. The obtained data is well displayed by “BDNF expression versus class plot”. It can be assumed that these differences are due to the context of the experiment rather than the use of memantine.

fit_inter <- lm(BDNF_N ~ class , data = BDNF_N_without_NA)
dat_tukey <- glht(fit_inter, linfct = mcp(class = 'Tukey'))
summary(dat_tukey)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = BDNF_N ~ class, data = BDNF_N_without_NA)
## 
## Linear Hypotheses:
##                        Estimate Std. Error t value Pr(>|t|)    
## c-CS-s - c-CS-m == 0  0.0030979  0.0055459   0.559   0.9993    
## c-SC-m - c-CS-m == 0 -0.0482717  0.0053980  -8.942    <0.01 ***
## c-SC-s - c-CS-m == 0 -0.0258249  0.0055459  -4.657    <0.01 ***
## t-CS-m - c-CS-m == 0 -0.0264852  0.0055459  -4.776    <0.01 ***
## t-CS-s - c-CS-m == 0 -0.0337570  0.0059483  -5.675    <0.01 ***
## t-SC-m - c-CS-m == 0 -0.0181541  0.0055459  -3.273   0.0242 *  
## t-SC-s - c-CS-m == 0 -0.0136310  0.0055790  -2.443   0.2213    
## c-SC-m - c-CS-s == 0 -0.0513696  0.0055459  -9.263    <0.01 ***
## c-SC-s - c-CS-s == 0 -0.0289228  0.0056900  -5.083    <0.01 ***
## t-CS-m - c-CS-s == 0 -0.0295831  0.0056900  -5.199    <0.01 ***
## t-CS-s - c-CS-s == 0 -0.0368549  0.0060829  -6.059    <0.01 ***
## t-SC-m - c-CS-s == 0 -0.0212520  0.0056900  -3.735    <0.01 ** 
## t-SC-s - c-CS-s == 0 -0.0167289  0.0057223  -2.923   0.0689 .  
## c-SC-s - c-SC-m == 0  0.0224468  0.0055459   4.047    <0.01 ** 
## t-CS-m - c-SC-m == 0  0.0217865  0.0055459   3.928    <0.01 ** 
## t-CS-s - c-SC-m == 0  0.0145147  0.0059483   2.440   0.2227    
## t-SC-m - c-SC-m == 0  0.0301176  0.0055459   5.431    <0.01 ***
## t-SC-s - c-SC-m == 0  0.0346406  0.0055790   6.209    <0.01 ***
## t-CS-m - c-SC-s == 0 -0.0006603  0.0056900  -0.116   1.0000    
## t-CS-s - c-SC-s == 0 -0.0079321  0.0060829  -1.304   0.8972    
## t-SC-m - c-SC-s == 0  0.0076708  0.0056900   1.348   0.8797    
## t-SC-s - c-SC-s == 0  0.0121939  0.0057223   2.131   0.3951    
## t-CS-s - t-CS-m == 0 -0.0072718  0.0060829  -1.195   0.9332    
## t-SC-m - t-CS-m == 0  0.0083311  0.0056900   1.464   0.8260    
## t-SC-s - t-CS-m == 0  0.0128542  0.0057223   2.246   0.3244    
## t-SC-m - t-CS-s == 0  0.0156029  0.0060829   2.565   0.1697    
## t-SC-s - t-CS-s == 0  0.0201260  0.0061130   3.292   0.0228 *  
## t-SC-s - t-SC-m == 0  0.0045231  0.0057223   0.790   0.9936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
fit_inter2 <- lm(mean ~ class , data = BDNF_mean)
dat_tukey2 <- glht(fit_inter2, linfct = mcp(class = 'Tukey'))
summary(dat_tukey2)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = mean ~ class, data = BDNF_mean)
## 
## Linear Hypotheses:
##                        Estimate Std. Error t value Pr(>|t|)  
## c-CS-s - c-CS-m == 0  0.0032333  0.0162035   0.200   1.0000  
## c-SC-m - c-CS-m == 0 -0.0482000  0.0157713  -3.056   0.0611 .
## c-SC-s - c-CS-m == 0 -0.0257667  0.0162035  -1.590   0.7538  
## t-CS-m - c-CS-m == 0 -0.0264333  0.0162035  -1.631   0.7294  
## t-CS-s - c-CS-m == 0 -0.0335286  0.0173792  -1.929   0.5358  
## t-SC-m - c-CS-m == 0 -0.0179889  0.0162035  -1.110   0.9522  
## t-SC-s - c-CS-m == 0 -0.0129889  0.0162035  -0.802   0.9924  
## c-SC-m - c-CS-s == 0 -0.0514333  0.0162035  -3.174   0.0446 *
## c-SC-s - c-CS-s == 0 -0.0290000  0.0166245  -1.744   0.6579  
## t-CS-m - c-CS-s == 0 -0.0296667  0.0166245  -1.785   0.6320  
## t-CS-s - c-CS-s == 0 -0.0367619  0.0177723  -2.068   0.4451  
## t-SC-m - c-CS-s == 0 -0.0212222  0.0166245  -1.277   0.9039  
## t-SC-s - c-CS-s == 0 -0.0162222  0.0166245  -0.976   0.9762  
## c-SC-s - c-SC-m == 0  0.0224333  0.0162035   1.384   0.8606  
## t-CS-m - c-SC-m == 0  0.0217667  0.0162035   1.343   0.8782  
## t-CS-s - c-SC-m == 0  0.0146714  0.0173792   0.844   0.9897  
## t-SC-m - c-SC-m == 0  0.0302111  0.0162035   1.864   0.5791  
## t-SC-s - c-SC-m == 0  0.0352111  0.0162035   2.173   0.3809  
## t-CS-m - c-SC-s == 0 -0.0006667  0.0166245  -0.040   1.0000  
## t-CS-s - c-SC-s == 0 -0.0077619  0.0177723  -0.437   0.9998  
## t-SC-m - c-SC-s == 0  0.0077778  0.0166245   0.468   0.9998  
## t-SC-s - c-SC-s == 0  0.0127778  0.0166245   0.769   0.9941  
## t-CS-s - t-CS-m == 0 -0.0070952  0.0177723  -0.399   0.9999  
## t-SC-m - t-CS-m == 0  0.0084444  0.0166245   0.508   0.9996  
## t-SC-s - t-CS-m == 0  0.0134444  0.0166245   0.809   0.9920  
## t-SC-m - t-CS-s == 0  0.0155397  0.0177723   0.874   0.9873  
## t-SC-s - t-CS-s == 0  0.0205397  0.0177723   1.156   0.9412  
## t-SC-s - t-SC-m == 0  0.0050000  0.0166245   0.301   1.0000  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Predict the production level of ERBB4_N protein based on data on other proteins

Full linear model

Let’s built a linear model of the dependence of ERBB4_N expression on other proteins. We can notice that one of the proteins strongly influences the expression of the protein of interest PSD95_N. Three more proteins with good pvalues can also be distinguished: BAX_N , Tau_N , pPKCG_N.

lin_mod <- lm(ERBB4_N ~ . - MouseID - measurement - Genotype - Treatment -Behavior - class  , mouses)
summary(lin_mod)
## 
## Call:
## lm(formula = ERBB4_N ~ . - MouseID - measurement - Genotype - 
##     Treatment - Behavior - class, data = mouses)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.023037 -0.003370 -0.000127  0.003071  0.031955 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.566e-02  8.245e-03   3.112 0.001970 ** 
## DYRK1A_N        -6.824e-03  7.475e-03  -0.913 0.361710    
## ITSN1_N          8.127e-03  1.058e-02   0.768 0.442776    
## BDNF_N           3.833e-02  1.919e-02   1.997 0.046378 *  
## NR1_N           -1.409e-02  6.685e-03  -2.108 0.035544 *  
## NR2A_N          -1.049e-04  1.361e-03  -0.077 0.938610    
## pAKT_N           7.961e-02  3.044e-02   2.615 0.009203 ** 
## pBRAF_N         -4.103e-02  3.065e-02  -1.339 0.181356    
## pCAMKII_N       -1.451e-05  7.954e-04  -0.018 0.985451    
## pCREB_N         -6.015e-02  2.910e-02  -2.067 0.039290 *  
## pELK_N           4.731e-05  1.030e-03   0.046 0.963391    
## pERK_N          -1.987e-02  7.887e-03  -2.519 0.012109 *  
## pJNK_N          -6.035e-02  2.538e-02  -2.378 0.017814 *  
## PKCA_N           8.377e-03  2.736e-02   0.306 0.759625    
## pMEK_N           8.498e-03  2.706e-02   0.314 0.753676    
## pNR1_N          -2.566e-02  1.334e-02  -1.924 0.054968 .  
## pNR2A_N          1.004e-02  7.239e-03   1.388 0.165927    
## pNR2B_N          1.928e-02  6.659e-03   2.896 0.003959 ** 
## pPKCAB_N         5.727e-03  2.773e-03   2.065 0.039457 *  
## pRSK_N           1.001e-02  1.427e-02   0.701 0.483573    
## AKT_N           -6.738e-04  8.183e-03  -0.082 0.934413    
## BRAF_N           1.550e-02  1.213e-02   1.278 0.201926    
## CAMKII_N        -3.429e-03  2.011e-02  -0.170 0.864700    
## CREB_N          -1.450e-02  3.091e-02  -0.469 0.639135    
## ELK_N            3.510e-03  3.719e-03   0.944 0.345811    
## ERK_N            4.660e-03  2.804e-03   1.662 0.097208 .  
## GSK3B_N         -5.840e-03  6.824e-03  -0.856 0.392588    
## JNK_N           -1.188e-02  3.378e-02  -0.352 0.725317    
## MEK_N            8.785e-03  2.197e-02   0.400 0.689510    
## TRKA_N           5.722e-03  1.646e-02   0.348 0.728257    
## RSK_N           -2.299e-02  3.915e-02  -0.587 0.557214    
## APP_N           -5.345e-03  1.825e-02  -0.293 0.769751    
## Bcatenin_N       1.107e-03  5.298e-03   0.209 0.834546    
## SOD1_N          -6.241e-03  3.985e-03  -1.566 0.118018    
## MTOR_N           4.161e-02  1.651e-02   2.520 0.012048 *  
## P38_N           -1.619e-02  1.348e-02  -1.201 0.230267    
## pMTOR_N         -9.530e-03  7.667e-03  -1.243 0.214475    
## DSCR1_N         -6.849e-03  1.023e-02  -0.670 0.503374    
## AMPKA_N          2.522e-02  2.294e-02   1.099 0.272263    
## NR2B_N           7.391e-03  9.146e-03   0.808 0.419449    
## pNUMB_N         -6.467e-03  2.006e-02  -0.322 0.747326    
## RAPTOR_N         4.750e-02  2.465e-02   1.927 0.054629 .  
## TIAM1_N         -3.284e-02  1.956e-02  -1.679 0.093892 .  
## pP70S6_N         2.739e-03  5.307e-03   0.516 0.606022    
## NUMB_N          -3.304e-02  3.784e-02  -0.873 0.383042    
## P70S6_N         -4.357e-03  5.119e-03  -0.851 0.395118    
## pGSK3B_N         1.291e-01  4.063e-02   3.179 0.001577 ** 
## pPKCG_N         -7.636e-03  1.962e-03  -3.893 0.000113 ***
## CDK5_N           7.111e-04  9.807e-03   0.073 0.942225    
## S6_N             1.566e-02  7.647e-03   2.049 0.041059 *  
## ADARB1_N        -2.609e-03  2.040e-03  -1.279 0.201557    
## AcetylH3K9_N     2.113e-03  1.103e-02   0.192 0.848199    
## RRP1_N          -1.488e-02  1.038e-02  -1.434 0.152232    
## BAX_N           -1.304e-01  3.542e-02  -3.680 0.000260 ***
## ARC_N            1.687e-01  6.748e-02   2.499 0.012775 *  
## nNOS_N          -4.749e-03  2.854e-02  -0.166 0.867918    
## Tau_N            7.082e-02  1.810e-02   3.913 0.000104 ***
## GFAP_N          -4.703e-02  5.400e-02  -0.871 0.384291    
## GluR3_N         -7.133e-03  2.436e-02  -0.293 0.769812    
## GluR4_N         -7.030e-02  3.373e-02  -2.084 0.037672 *  
## IL1B_N           2.845e-02  1.081e-02   2.631 0.008794 ** 
## P3525_N          4.994e-02  2.423e-02   2.061 0.039810 *  
## pCASP9_N         8.118e-03  3.044e-03   2.667 0.007923 ** 
## PSD95_N          2.379e-02  3.811e-03   6.242 9.55e-10 ***
## SNCA_N          -1.137e-02  3.484e-02  -0.326 0.744195    
## Ubiquitin_N      2.330e-03  4.883e-03   0.477 0.633468    
## pGSK3B_Tyr216_N  2.808e-02  1.006e-02   2.793 0.005439 ** 
## SHH_N            4.904e-02  1.998e-02   2.454 0.014471 *  
## BAD_N           -3.592e-02  3.508e-02  -1.024 0.306311    
## BCL2_N           6.173e-03  3.112e-02   0.198 0.842840    
## pS6_N                   NA         NA      NA       NA    
## pCFOS_N         -1.232e-02  3.111e-02  -0.396 0.692240    
## SYP_N            1.354e-02  1.079e-02   1.254 0.210395    
## H3AcK18_N        2.401e-03  2.527e-02   0.095 0.924349    
## EGR1_N          -6.016e-03  2.619e-02  -0.230 0.818406    
## H3MeK4_N         1.227e-02  2.154e-02   0.570 0.569097    
## CaNA_N          -1.749e-03  3.419e-03  -0.512 0.609159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005885 on 476 degrees of freedom
##   (528 observations deleted due to missingness)
## Multiple R-squared:  0.8641, Adjusted R-squared:  0.8427 
## F-statistic: 40.37 on 75 and 476 DF,  p-value: < 2.2e-16

Diagnostics

In theory, based on this model, we can talk about the influence of some proteins on the ERBB4_N expression. But you must first diagnose the model, is it correct to use such a model. For this purpose, we will carry out its diagnostics.

Multicollinearity check

I have great suspicions that we will observe multicollinearity in this model, since proteins do not work separately, but are linked to each other in molecular cascades.

At first, I ran into a difficulty when checking for multicollinearity, namely: there are aliased coefficients in the model. This happens (as I read) when faced with ideal multicollinearity. Such a predictor can be identified

alias( lm( ERBB4_N ~ . - MouseID - measurement - Genotype - Treatment -Behavior - class , mouses ) )
## Model :
## ERBB4_N ~ (MouseID + measurement + DYRK1A_N + ITSN1_N + BDNF_N + 
##     NR1_N + NR2A_N + pAKT_N + pBRAF_N + pCAMKII_N + pCREB_N + 
##     pELK_N + pERK_N + pJNK_N + PKCA_N + pMEK_N + pNR1_N + pNR2A_N + 
##     pNR2B_N + pPKCAB_N + pRSK_N + AKT_N + BRAF_N + CAMKII_N + 
##     CREB_N + ELK_N + ERK_N + GSK3B_N + JNK_N + MEK_N + TRKA_N + 
##     RSK_N + APP_N + Bcatenin_N + SOD1_N + MTOR_N + P38_N + pMTOR_N + 
##     DSCR1_N + AMPKA_N + NR2B_N + pNUMB_N + RAPTOR_N + TIAM1_N + 
##     pP70S6_N + NUMB_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N + 
##     S6_N + ADARB1_N + AcetylH3K9_N + RRP1_N + BAX_N + ARC_N + 
##     nNOS_N + Tau_N + GFAP_N + GluR3_N + GluR4_N + IL1B_N + P3525_N + 
##     pCASP9_N + PSD95_N + SNCA_N + Ubiquitin_N + pGSK3B_Tyr216_N + 
##     SHH_N + BAD_N + BCL2_N + pS6_N + pCFOS_N + SYP_N + H3AcK18_N + 
##     EGR1_N + H3MeK4_N + CaNA_N + Genotype + Treatment + Behavior + 
##     class) - MouseID - measurement - Genotype - Treatment - Behavior - 
##     class
## 
## Complete :
##       (Intercept) DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N
## pS6_N 0           0        0       0      0     0      0      0       0        
##       pCREB_N pELK_N pERK_N pJNK_N PKCA_N pMEK_N pNR1_N pNR2A_N pNR2B_N
## pS6_N 0       0      0      0      0      0      0      0       0      
##       pPKCAB_N pRSK_N AKT_N BRAF_N CAMKII_N CREB_N ELK_N ERK_N GSK3B_N JNK_N
## pS6_N 0        0      0     0      0        0      0     0     0       0    
##       MEK_N TRKA_N RSK_N APP_N Bcatenin_N SOD1_N MTOR_N P38_N pMTOR_N DSCR1_N
## pS6_N 0     0      0     0     0          0      0      0     0       0      
##       AMPKA_N NR2B_N pNUMB_N RAPTOR_N TIAM1_N pP70S6_N NUMB_N P70S6_N pGSK3B_N
## pS6_N 0       0      0       0        0       0        0      0       0       
##       pPKCG_N CDK5_N S6_N ADARB1_N AcetylH3K9_N RRP1_N BAX_N ARC_N nNOS_N Tau_N
## pS6_N 0       0      0    0        0            0      0     1     0      0    
##       GFAP_N GluR3_N GluR4_N IL1B_N P3525_N pCASP9_N PSD95_N SNCA_N Ubiquitin_N
## pS6_N 0      0       0       0      0       0        0       0      0          
##       pGSK3B_Tyr216_N SHH_N BAD_N BCL2_N pCFOS_N SYP_N H3AcK18_N EGR1_N
## pS6_N 0               0     0     0      0       0     0         0     
##       H3MeK4_N CaNA_N
## pS6_N 0        0
lin_mod <- update(lin_mod, .~. - pS6_N)

Now we can continue diagnostics

head(vif(lin_mod))
## DYRK1A_N  ITSN1_N   BDNF_N    NR1_N   NR2A_N   pAKT_N 
## 23.73661 59.47364 16.66961 96.31168 25.12792 21.88013

As expected: we see multicollinearity among the majority of predictors, and a very large (96!)

Outliers

On the plots of residuals and normality of distribution, we see outliers

lin_mod <- data.frame(fortify(lin_mod))

gg_resid <- ggplot(data = lin_mod, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red") +
   ggtitle("Plot of residuals from predicted values")
gg_resid
## `geom_smooth()` using formula 'y ~ x'

qqPlot(lin_mod$.stdresid)

## [1]  37 389

Correction

I consider it is better to build a linear model from the most significant predictors

lin_mod2 <- lm(ERBB4_N ~ PSD95_N + BAX_N + Tau_N  + pPKCG_N , mouses)
summary(lin_mod2)
## 
## Call:
## lm(formula = ERBB4_N ~ PSD95_N + BAX_N + Tau_N + pPKCG_N, data = mouses)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033795 -0.006833 -0.000175  0.006176  0.050748 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0584628  0.0038119  15.337  < 2e-16 ***
## PSD95_N      0.0334146  0.0013381  24.971  < 2e-16 ***
## BAX_N        0.0679042  0.0183684   3.697 0.000229 ***
## Tau_N        0.0727048  0.0048728  14.921  < 2e-16 ***
## pPKCG_N     -0.0024341  0.0005757  -4.228 2.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01072 on 1075 degrees of freedom
## Multiple R-squared:  0.4955, Adjusted R-squared:  0.4936 
## F-statistic:   264 on 4 and 1075 DF,  p-value: < 2.2e-16

Here we see the absence of multicollinearity, however, the residual graph still looks bad, there are many outliers.

vif(lin_mod2)
##  PSD95_N    BAX_N    Tau_N  pPKCG_N 
## 1.087134 1.121878 1.060838 1.040219
lin_mod2 <- data.frame(fortify(lin_mod2))

gg_resid2 <- ggplot(data = lin_mod2, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red") +
   ggtitle("Plot of residuals from predicted values")
gg_resid2
## `geom_smooth()` using formula 'y ~ x'

qqPlot(lin_mod2$.stdresid)

## [1] 564 794

Visulization

We visualize the predicted values on an artificial dataset, where the most significant predictor PSD95_N is in the range from the minimum to the maximum value, and the rest of the predictors are equal to the mean.

model <- lm(ERBB4_N ~ PSD95_N + BAX_N + Tau_N  + pPKCG_N , data = mouses)

# Dataset for model prediction
MyData <- data.frame(
  PSD95_N = seq(min(mouses$PSD95_N), max(mouses$PSD95_N), length.out = 500),
  BAX_N = mean(mouses$BAX_N),
  pPKCG_N = mean(mouses$pPKCG_N),
  Tau_N = mean(mouses$Tau_N)
  )

# Predicted values
Predictions <- predict(model, newdata = MyData,  interval = 'confidence')
MyData <- data.frame(MyData, Predictions)


# Model prediction plot
Pl_predict <- ggplot(MyData, aes(x = PSD95_N, y = fit)) +
  geom_ribbon(alpha = 0.2, aes(ymin = lwr, ymax = upr)) +
  geom_line() + 
  geom_point(data= mouses, alpha = 0.2, color = 'plum4' , aes(y = ERBB4_N)) +
  ggtitle("Multiple model prediction plot") +
  labs(y="ERBB4_N - fitted")
Pl_predict 

As a result, I built a model for predicting the expression of ERBB4_N depending on the expression of only a few proteins (the most significant ones). Since the model does not meet the conditions of applicability, I would not recommend using it.

PCA

Principal component analysis allows us to reduce the dimension of the data, describe the relationship between a large number of features and rank them by importance.

Our original dataset is very large, so we will only use a part of it, namely, we will use a data subset containing only complete observations. We will leave in it only information about the expression of various proteins and about the class of the mouse. Let’s also look at the representation of classes.

mice_set <- mouses_without_na[, -c(1, 2, 79, 80, 81,82)] 
table(mice_set$class)
## 
## c-CS-m c-CS-s c-SC-m c-SC-s t-CS-m t-CS-s t-SC-m t-SC-s 
##     45     75     60     75     90     75     60     72

Let’s do a principal component analysis for our data.

mouse_pca <- rda(mice_set[, -77], scale = T)

First, let’s find out how the components contribute. To do this, turn to the summary and visualize the Proportion Explained.

pca_summary <- summary(mouse_pca)
pca_result <- as.data.frame(pca_summary$cont)
plot_data <- as.data.frame(t(pca_result[c("Proportion Explained"),]))
plot_data$component <- c(1:75)

ggplot(plot_data, aes(component , `Proportion Explained`)) + geom_bar(stat = "identity") +ggtitle(label = "Component contribution")

The first three components explain approximately 30, 17 and 10% of the variability.

Also, to understand which components contribute to the greatest contribution, we plot the eigenvalues.

screeplot(mouse_pca, type = "lines", bstick = TRUE)

Let’s choose 4 components for analysis.

loading plots

Loading plots allow you to evaluate the correlation of components with each other, as well as with the component axes. Let’s look at some of these plots.

biplot(mouse_pca, display = "species", scaling = 2, choices = c(1,2))

biplot(mouse_pca, display = "species", scaling = 2, choices = c(1,4))

biplot(mouse_pca, display = "species", scaling = 2, choices = c(2,3))

biplot(mouse_pca, display = "species", scaling = 2, choices = c(2,4))

biplot(mouse_pca, display = "species", scaling = 2, choices = c(3,4))

Unfortunately, due to the large amount of proteins, these graphs are not very descriptive.

Ordination in PCA

Let’s look at the ordination plots, which reflects the ordination / location of individual objects (mice) in space. We see that it is difficult to single out certain clusters. The most pronounced patterns are on the ordination chart for components 2 and 4. On it you can see that the mice are grouped by the context of the experiment (CS or SC).

df_scores <- data.frame(mice_set,
  scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))

ggplot(df_scores, aes(x = PC4, y = PC2, colour = class)) + 
  geom_point() + ggtitle("Ordination in PC4 and PC2 space")

df_scores <- data.frame(mice_set,
  scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))

ggplot(df_scores, aes(x = PC1, y = PC2, colour = class)) + 
  geom_point() + ggtitle("Ordination in PC1 and PC2 space")

df_scores <- data.frame(mice_set,
  scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))

ggplot(df_scores, aes(x = PC1, y = PC3, colour = class)) + 
  geom_point() + ggtitle("Ordination in PC1 and PC3 space")

3D Vizualuzation

Among the packages for 3D visualization of the first three components (plot3D, rgl and plotly), I chose plotly because it is very simple and intuitive, and as a result, you can get an interactive graph! It is difficult to single out individual patterns on the chart. Perhaps the averaged expression values for each mouse should have been taken, since 72 points would be better distinguishable on the graph.

fig <-  plot_ly(df_scores, x =~PC1, y = ~PC2, z = ~PC3, color = ~class, size = 0.1  ) 
fig